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ABSTRACT 

There is an overwhelmingly large literature and algorithms already available on ‘large scale 
inference problems’ based on different modeling techniques and cultures. Our primary goal 
in this paper is not to add one more new methodology to the existing toolbox but instead 

(a) to clarify the mystery how these different simultaneous inference methods are connected, 

(b) to provide an alternative more intuitive derivation of the formulas that leads to simpler 
expressions in order (c) to develop a unified algorithm for practitioners. A detailed discussion 
on representation, estimation, inference, and model selection is given. Applications to a 
variety of real and simulated datasets show promise. We end with several future research 
directions. 

Keywords: Mixed-sample problem, False discovery rate; Comparison density; RKHS; 
Smooth p-value; Skew-beta density decomposition; Pre-flattening smoothing; Tail modeling. 

1 Introduction 

1.1 Mixed Sample Inference 

Consider n i.i.d. observation Z\, ... , Zn from the continuous distribution F = 7r 0 F 0 + (1 — 
7r 0 )G, 0 < 7r 0 < 1, where F 0 is specified distribution (known) and G is unknown mixing 
distribution. As the given random sample may contain observations from both F 0 and G, 
we call this framework a ‘mixed-sample problem.’ This can be considered as an intermediate 
statistical inference problem between two extremes: one and two sample inference problems. 
Typical goals of mixed data inference problems include: (1) identifying the Zf s coming from 
the distribution G (also known as signal detection), and (2) estimating the proportion 7r 0 . 
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Motivation. A multiple hypothesis testing problem can be thought of as a mixed-sample 
problem by considering Zf s to be the test statistic corresponding to a null hypothesis H 0i 
(i = 1,..., N ), where the goal is simultaneous inference. Our main motivation is to develop a 
comprehensive modeling framework for big-IV signal detection (or large-scale mixed-sample) 
problems in a way that is analogous to the one-sample and two-sample approaches so that 
we can cover the whole statistical inference spectrum (one sample —> mixed sample —» 
two sample) using one general concept and tool - which will simplify theory, practice and 
teaching. 


1.2 Goals and Organization 


The topic of multiple hypothesis testing, which began with the pioneering work of Tukey 
on “Comparing individual means in the analysis of variance” published in 1949 Biometrics, 
gained new momentum in the 21st century with modern high-throughput data acquisition. 
Since then an enormous amount of research has been conducted on this topic. Following is 
a summary of our research contributions: 


(1) We seek to condense and systemize the vast literature on large-scale simultaneous in¬ 
ference methods based on a few fundamental concepts and notations introduced in Section 
2. The author believes that this could ultimately help applied data scientists to gain bet¬ 
ter insight for selecting the proper signal detection tool from the existing large inventory 
by simplifying the practice. Section 2.2 describes a unified representation theory and an 
alternative rationale of three main classes of large-scale inference procedures: Benjamini 


Hochberg’s false discovery rate (FDR) approach (Benjamini and Hochberg, 1995), Efron’s 


empirical Bayes local FDR proposal (Efron et al. 2001), and Donoho Jin’s Higher Criticism 


thresholding (Donoho and Jin, 2004). 

(2) Section 2.2 further shows how this new theoretical framework for large scale simultaneous 
inference problems can integrate and reconcile frequentist and Bayesian cultures, bypassing 


the old philosophical and ideological differences (Benjamini, 2008| to develop unified efficient 
algorithms using single concept and notation. 


(3) The theory presented in Section 2 motivated us to introduce a new class of nonparametric 
signal detection algorithm by converting the large-scale inference problems into a function 
estimation problem based on the ideas of comparison density and distribution (CD); see 
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Web Appendix A for more discussion on the historical significance of the current research. 
Section 3.1 describes the statistical modeling challenges. A comprehensive CD-based non- 
parametric modeling solution for ‘signal-hunting’ procedure is described in Section 3.2-3.5, 
which consists of three main components: (A) We discover that comparison density takes a 
‘universal stylized shape’’ for almost all large-scale inference problems - ‘U’ shaped density 
over the compact unit interval. To tackle this nonstandard density we propose a specially de¬ 
signed Skew-Beta decomposition-based nonparametric estimation algorithm; (B) To model 
empirical null we propose a robust biweight M-estimator, which is easy to compute; (C) 
Finally, we describe the Minimum Deviance Criteria (MDC) algorithm for estimating true 
null proportion. Section 4 presents numerical results, and discussions are given in Section 
5. 


2 Theory 


2.1 Background on Functional Statistical Inference 


Let Z\, Z 2 ,..., Zn be a mixed random sample, with the majority of the observations coming 
from continuous null distribution Fq, and a small proportion from unknown contaminating 
distribution G: F = tt 0 F 0 + (1 — 7t 0 )G, 0 < 7T 0 < 1, where Z, ’s are the corresponding test 
statistic for the hypothesis testing problem and the goal is simultaneous inference. Here 
we will provide an intuitive introduction to the functional statistical approach for large-scale 
simultaneous inference problems whose goal is to detect false null hypotheses. 


The cornerstone of our approach is based on the concept of comparison density (Parzen 


1983). For continuous F and G, define comparison distribution function D(u;G,F ) : = 


F{G 1 (u)) and the corresponding comparison density by: 


d(w, G , F) = 


nG-\u)) 


0 < u < 1. 


( 2 . 1 ) 


Under Hq : F = Fq, i.e., when all the observations actually come from Fq, the theory 
of weak convergence of comparison distribution process ( [Parze n, 1999, Thas, 2010) tells 
us VN(D(u ) — u) -4 B (u) as N —* oo, where B(u),0 < u < 1 is the Brownian bridge 
process, and D(u) = F(F 0 -1 (w)), where F(z]Z) = N^ 1 Yl!i=i^{Zi < z ). In mixed-sample 
problems, our goal is to identify the signals or the ZJ s that are coming from G. If all 
the null-hypothesis H 0 i,H 02 , ... , H 0N are true (i.e., under H 0 : F = F 0 ), we would expect 
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D(u) := D(u; F 0 , F) ~ u. Thus, intuitively, the collection of u' s, for which the distance 
between D(u) and u are large, should be suspected as potential signals. In other words, 

{u : D{u)/u > 7 }, 

for a suitably selected threshold might contain the signals or the rejected null hypotheses. 
Now instead of comparison distribution function D(u), in the same spirit, we can develop 
a similar strategy for detecting false null hypotheses based on comparison density d(u). So 
the performance of simultaneous inference critically depends on the fundamental comparison 
density (or equivalently distribution) function. 

From Multiple Testing to Nonparametric Function Estimation. The crux of this section is 
this: the notion of comparison distribution and density allows us to transform the simulta¬ 
neous hypothesis testing problem into a nonparametric function approximation problem - 
a functional statistical approach to large scale inference. In the next section, we will make 
this intuitive mixed-sample signal detection approach formal and rigorous. 


2.2 Unified Representation and Theory of Threshold Selection 


The previous section intuitively argued how to build a CD-based simultaneous inference 
algorithm. The purpose of this section is threefold: (1) to derive proper thresholds for CD- 
based simultaneous inference procedures that can lead to desirable methods with provable 
guarantees; (2) to show how the Frequentist and Bayesian large-scale inference algorithms 
can be connected using the theory of reproducing kernel Hilbert space (RKHS) of the limiting 
Brownian bridge process of the comparison distribution; (3) to simplify the currently existing 
vast literature on large-scale multiple testing by providing unified representation based on 
a single concept and notation - comparison density/distribution, to our knowledge largely 
an unexplored area of research. 


Let’s start with the question of how to select the proper threshold. The answer depends on 
the appropriate error rate that we seek to control in multiple hypothesis testing. Based on 


particular set up and purpose of any specific inference problems Benjamini (2010) provided 
a list of 14 such error rates. 


One of the important error-rate criteria (among many possibilities) in large-lV multiple 
testing is to control the false discovery rate (FDR) - the expected proportion of Type I 


errors among the rejected hypotheses, suggested by Benjamini and Hochberg (1995). The 
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following theorem describes how to select the threshold for comparison distribution-based 
method that can guarantee FDR < a (for some preselected significance level a) under any 
arbitrary hq and G. A sketch of the proof is provided in Web Appendix B. 

Theorem 1. Consider testing N independent null hypothesis H( 0 i),..., H( 0 n) based on cor¬ 
responding ordered p-values itm,..., u^n)- Define the index set 

lZ=\i<k\ k = argmaXj (2.2) 

l OL J 

Then the procedure that rejects H^fi G 1Z controls FDR at the level a, regardless of the 
distribution of the test statistic corresponds to false null hypothesis. 


To see how CD-based algorithm (2.2) essentially equivalent to the (adaptive) Benjamini 
Hochberg’s (BH) FDR controlling procedure, note that D(up, F 0 , F) = FFfi 1 ^^)) = i/N , 
which is equivalent to saying that reject i = 1,..., k, where 


X OL 

k = max {1 < i < N : ui n < — — ). 

1 “ “ ~Ntiq 


RKHS-based dual representation. Here we will give an alternative CD-based FDR controlling 
procedure motivated by the isometric isomorphism (one-one inner product preserving map) 
relation between the Hilbert space spanned by the Brownian bridge process (which is the 
limit process of \/N ( D[u ) —u) and the corresponding RKHS associated with the covariance 
kernel A B (u, v). The following theorem provides the complete characterization of the RKHS. 

Theorem 2. The reproducing kernel Hilbert space Hx associated with the Brownian bridge 
covariance kernel Km{u,v) = min(-u,v) — uv, 0 < u, v < 1 consists of L 2 differentiable 
functio7is with the inner produce = fy <fi(u) d u that satisfies (f)(0) = 0(1) = 0. 


It is sufficient to prove the reproducing property relative to the inner product in the following 
sense, for all 0 G H K 

(K m (u, -),0) = 0(u), for any u G [0,1]. (2.3) 

For details of the proof see Web Appendix C. 

Moving toward Empirical Bayes Formulation. As the RKHS associated with the Brownian 
bridge covariance kernel equipped with the norm squared ||h|| 2 = J 0 ' \h!(u)\ 2 d u, this tells 
us that the other bonafide signal detection technique can equivalently be constructed by 
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looking at the distance between D'(u) = d(u ) and 1. That is, the collection of n’s for which 
d(u;F 0 ,F ) substantially deviates from Uniform[0,1], are precisely the non-null candidates 
that we are searching. A comparison density d{u\ Fq, F) based signal characterization (in¬ 
stead of comparison distribution function D(u\ F 0 ,F) is given in the following result, which 


turns out to be an alternative way of expressing Efron’s empirical Bayes local fdr (Efron 


et al. 2001) formula. 


Theorem 3. Local false discovery rate can alternatively be represented using the comparison 
density 


fdr(z) := Prjnull | Z = z} = 


TTO 


d(F 0 (z)- } F 0 ,Fy 


(2.4) 


The local fdr is defined as the conditional probability of a case being null or noise given 
Z = z, 


fdr(z) = Pr(null | Z — z) = 7T 0 


fo(z) 
f (z) 


= vr 0 / d(Fo(z)‘, F 0 , F). 


(2.5) 


ft is known that the local FDR is more conservative than the BH procedure. To address 
this issue, 


Efron 


(2007) recommended the threshold {u : d(u;F 0 ,F) > Tbh/ 2} (under the 


Lehmann alternatives), where Tbh = vr 0 /a, which he argued controls size, or Type I errors at 
the desired level. The goal of local FDR algorithm is to estimate the conditional probability 
Pr(null | Z = z) from the data. The traditional procedure estimates separately 7r 0 , f 0 (t) 
and fit), while our comparison density-based approach directly estimates the probability; 
see Web Appendix D for more discussion on this point. 

Relation with Higher Criticism. Another related method for detecting rare/weak signals in 
large-scale experiments is “Higher Criticism” (HC) thresholding (which is not primarily de¬ 


signed to protect against false discoveries) introduced by Donoho and Jin (2004) generalizing 


Tukey’s proposal (Tukey 1989). Following is the equivalent representation of HC-procedure 


using our notation. Reject i — 1,..., k where 


k = < 1 < i < a 0 N : argmax^ 


^(i) 


( 2 . 6 ) 


\/“(•)(! “ “(•)) ’ 

where a 0 = .1 is a common choice. We can study the asymptotic properties of HC procedure 
using the limiting unit variance Gaussian process of a/ w{u ) ( D(u ) — u), w(u ) = l/w(l — u ) 


with the corresponding covariance kernel K(u,v ) = 


Jin (2004) for further details. 


min(u, v ) — uv 
\/u( 1 — w)u(l — v) 


; See 


Donoho and 
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A New Class of Nonparametric Large-scale Inference Procedure. The representation results 
proved in this section allow us to develop a new class of CD-based approaches by specifying 
different models and estimation strategies for d (or equivalently D(u) = f ( " d(v) du). For 
example, the raw-empirical comparison distribution function D generates the BH procedure 


(2.2) or the HC procedure (2.6), which can be upgraded to a more enhanced technique by 
plugging in smooth nonparametric estimate D{u ) = J Q U d(v) dv. 


Nonparametric Function Approximation. We argue comparison density is an indispensable 
tool that provides the required theoretical foundation to develop a single general algorithm 
for modern large-scale signal detection problems. The CD-based approach unifies the fre- 
quentist BH procedure and empirical Bayes local false discovery concepts by converting the 
simultaneous inference problem into a nonparametric function estimation problem. 


Any comprehensive multiple testing algorithm should have three main components: (1) 
7 r 0 (null-proportion), (2) F 0 (empirical null distribution), and (3) d(u;F 0 ,F ) (comparison 
density). The efficiency of the signal detection procedure directly depends on the quality 
of estimation of these quantities. In the next section, we will describe a nonparametric 
estimation algorithm (CDfdr) in a step-by-step manner (along with real data illustration) 
to achieve this goal. 


The following section starts with an interesting observation that virtually all large-scale 
inference problems generate comparison density with a very special universal shape (see 
Web Appendix K for more discussion on the tail-behaviour of mixed-sample comparison 
density). It turns out that the traditional nonparametric density estimation methods fail 
to satisfactorily capture this typical shape of the comparison density over a compact [0,1] 
support. For that purpose, we develop a specially designed nonparametric algorithm (tailored 
for this particular characteristic shape) for parsimonious comparison density estimation d, 
which will be used for detecting signals or false null hypotheses and to estimate the null 
proportion 7r 0 . 
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3 Estimation 


3.1 Two Modeling Challenges 


cancer 


Fig [T](A) shows the two-sample t-test p-values of iV = 7129 gene expressions of Golub 
data (Golub et ah, 1999) while comparing ri\ = 27 acute lymphoblastic leukemia (ALL) and 
n 2 — 11 acute myeloid leukemia (AML) tumor samples to identify differentially expressed 
genes (data available in R package golubEsets). The comparison density, which is the 
distribution of p-values from this large-scale study indicates two nonparametric modeling 
challenges: First, we need density estimators for compact support [0,1]. Second, the sharp 
narrow peak near the boundary 0 necessitates modeling of the highly dynamic tail. 


Developing nonparametric density estimation for compact support [0,1] is known to be 
a challenging problem due to the “boundary effect.” Recently, there has been a great 


deal of interest in adapting kernel density estimator for unit interval (see Geenens (2014) 
and Wen and Wu (2014)) by transforming the variable of interest into another one whose 
density estimation should be free from boundary problems, and then finally transforming 
that estimate back into the initial scale. However, it has been recognized that these methods 
are computationally not efficient. Other approaches, like regression-based density estimators 
via smoothing splines or local polynomials, are known to have a larger variance near the 


boundaries (Thas, 2010). 


For large-scale signal detection problems, besides compact support, the other more challeng¬ 
ing modeling aspect is to tackle the highly dynamic tail near the boundaries 0 and 1, where 
all previously mentioned density estimators perform poorly. We will propose a specially tai¬ 
lored new genre of density estimator to tackle this typical non-standard “U” shaped density 
on the unit interval. 


We do note, however, that, it is not difficult to propose new density estimation techniques 
to fit the data, such as Fig [l](A) , but it is less trivial to come up with a sparse parametric 
model that fits the data well and is easy to interpret. The most stunning fact about our 
approach (that we will elaborate on the next section) is that it require only three parameters 
to accurately model the golub p-values, including the tail region! 

To model this typical shape of comparison density (that arises in the large scale signal 






















detection problems), we prefer estimators that are simultaneously accurate, computationally 
simple and parsimonious. Our proposal consists of two main steps: (1) converting “spiky” 
p-values to “smooth” p-values via the preflattening technique as described in Section 3.2 
and (2) estimating “smooth” p-values by expanding preflattened d(u) (nonparametric L 2 
or orthogonal series density estimator) or log d(u) (maximum entropy exponential density 
estimator) as a linear combination of orthonormal shifted Legendre polynomials (whose 
support is [0,1]), described in Section 3.3. The novelty of our approach lies in its unique 
ability to “decouple” the density estimation problem into two separate modeling problems: 
the tail part and the central part of the distribution. 

[Figure 1 about here.] 


3.2 Skew-Beta Density Estimator: Tail Modeling 


We propose a new genre of nonparametric density estimation technique, which starts with 
a parametric model g(x) and permits the following universal decomposition-. 


f(x) = g(x) x d[G(x)\G,F]. 


(3.1) 


Verify comparison density (2.1) evaluated at G(x) has the form d(G(x);G, F) = f(x)/g(x). 
We call this new class as Skew-G density model for fix). 


To efficiently capture the typical shape of the p-values shown in FigfIjA), we select g(x) on 
[0,1] in such a way that it can tackle the rapidly changing tail. Hence, beta distribution is 
a good candidate for g(x), which can act as pre-flattening function, as shown in Fig[l](B,C). 
The resulting skew-beta comparison density estimator has the following decomposition: 


d(u]F 0 ,F) 


f B (u-,a,/3) x d[F B (u-a,P)- F b ,F], for 0 < u < 1, 


(3.2) 


where beta density and distribution is denoted by f B and F B . We define the quantity 
F b (u] a, /3) as the “smooth” p-value, shown in Fig[l](C) , whose density is d[F B (u; a, /3 ); F B , F], 
0 < u < 1. Use simple maximum-likelihood estimate a and j3 to get the fitted beta; see 
Web Appendix E for more discussion and simple one line R implementation. Note that if 
none of the genes were differentially expressed for Golub data then we would expect the 
distribution of the pvalues to be Beta(cc = l,/3 = 1), i.e., uniform. This simple observation 
can lead to a quick diagnostic for testing the presence of signals. Web Appendix F describes 
the procedure. 
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3.3 Estimating density of “Smooth” P-values 


In this Section, we develop a nonparametric density estimator for smooth (Beta-transformed) 
p-values d(v,F B ,F) on the unit interval, where v = F B (w,a,/3). Our proposal is based on 
L 2 or maximum entropy exponential comparison density estimator by expanding it in an 
orthonormal basis. We select shifted orthonormal Legendre Polynomials (LP) Leg^n), 
j — 1, 2,... as basis, which form a complete orthonormal basis for L 2 [0,1] Hilbert space of 
square-integrable functions. 


L 2 orthogonal series estimator: d(v; F B , F) — 1 = JA LP \j] F B , F] Leg^n) (3.3) 
Maximum entropy estimator: log d(v;F B ,F) = 9j Leg^u) — K(0) (3.4) 

Due to simplicity and computational ease (which will be clear soon) we use L 2 estimate, 


which has nice asymptotic theoretical properties (Anderson and de Figueiredo, 1980). The 
large -N paradigm (for golub data N = 7129) makes all of these asymptotic analyses very 
much relevant and directly applicable. However, the reader is free to use the nonparametric 
maximum entropy exponential estimate (3.4). 

Computation of LP-Fourier Coefficients. The L 2 orthogonal expansion coefficients satisfies 

i 

LP \j-F B ,F] = f d(v;F B ,F) x Legj(y) dv. (3.5) 


Substitute v by F B (u ; a, j3) and note 


d[F B (ir,a,(3);F Bl F] = 


d(u ; F 0 , F) 
f B (u-,a,P)' 


This lead to the following representation: 


LPbm.-F] = / d(u;F 0 ,F) Legj {F B (u;a,/3)} du. 


(3.6) 


Eq. (3.6) implies that the LP-coefficients can be represented as the mean of Leg^ evaluated 
at the F B (u;a,f3) (these are Beta-transformed smooth pvalues). This representation result 
is summarized in the following theorem. 


Theorem 4. The Fourier-LP coefficients admit the following representation 

i 

LP[j; F b , F] = j Le gj (v)dD(v,F Bj F) = E[L e gj {F B (U-,a,P)}]. 
o 


10 







As a practical consequence, this result allows fast computation of the LP expansion coeffi¬ 
cients as mean of the Leg^ score functions evaluated at the transformed smooth pvalues: 

N 

LP[j; F b , F] <r- N-^Legj [F B ( Ui -a, P)\. (3.7) 

2—1 


Adaptive Estimation. Estimate the smooth nonparametric model by selecting the ‘signifi¬ 


cantly large’ LP-coefficients in a data-driven way. We will use the Ledwina (1994) scheme 
using Schwarz selection criterion, which is known to be consistent. To identify the important 
coefficients using Ledwina’s data-driven model selection criterion, we first rank the squared 
LP-coefficients. Then we take the penalized cumulative sum of k coefficients using informa¬ 
tion criterion IV -1 log(N)k and choose the k for which this is the maximum. Fig[l](D) shows 
the resulting smooth estimate for the golub data. 


3.4 Estimating Proportion of True Null Hypothesis 

In Section 2.2 we have seen the thresholds for CD-based false discovery methods depend 
on the parameter no, the true proportion of noise or null hypothesis. The following is our 
proposed data-analytic estimation algorithm. 

Algorithm 1 [ 7 r 0 estimation by Minimum Deviance Criteria (MDC)] 

Step 1. Define U\ = {tq : d(u;F 0 ,F) < A}, where d is the estimated beta-preflattened 
nonparametric comparison density; \U\\ = N\. For each fixed A perform the following steps. 

(a) Compute LPa [j] <— Leg^tq), which is the score coefficient for the following 

L 2 comparison density d\{u) = 1 + LPa[j] Legj('u), based on U\. 

(b) Calculate the deviance statistic I\ •<— Xqli |LPa[ j] | 2 - 

Step 2. Display the deviance path (A, I\) on a fine grid of A G [1, 7 ] and set A* arg rniiiA I\. 
Step 3. Output n 0 = N\*/N. 

The rationale behind the algorithm comes from the simple fact that under H 0 , when all the 
cases are null, the underlying comparison density d{u\ F (j , F) should not deviate much from 
Uniform[0,l] as D{u] F 0 , F) = u. Parseval’s theorem dictates the following equality for L 2 
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comparison density: 


[d(u-,F 0 ,F)-l] 2 du = ]T|LP[j;Fj,F]| 2 . 


(3.8) 


In the light of Eq. (3.8) the statistic I\ can be interpreted as the deviation of d\(u) from 
uniformity. Fig 1 of Web supplement illustrates this idea for Prostate cancer dataset (de¬ 
scribed in Section 4), where the shape of the estimated comparison density clearly indicates 
the presence of signals in the two tails. The deviance path for the rejection region of interest 
1 < A < 7 = 3.5 is shown in the right panel for M = 10, which gives arg min A I x = 1.98 and 
7r 0 = 0.971. At the point A* = 1.98 the deviance statistics takes the minimum value, which 
implies that the set U.\=\m is most likely consists of the null cases. 

Connection with other approaches. Instead of working with comparison density, there are 
related methods based on comparison distribution function to identify p-values that differ 
significantly from the null Uniform[0,1]. Our method could be viewed as the formal density 


analogue of the graphical method proposed in Schweder and Spjptvoll (1982). Another 


similar method proposed by Storey (2002) estimates 7 r 0 (A) = (1 — D( A))/(l — A). They 
proposed a computationally intensive method to select the tuning parameter A. It consists 
of bootstrapping p-values for each A in the range {0, .05,..., .95} and selecting the one which 
minimizes mean-squared error (MSE). 


3.5 Estimating Empirical Null 


Until this point we have assumed that the continuous null distribution Fq is completely 


specified. As noted by Efron (2010), “This is a less-than-usual occurrence” for real datasets. 


Efron (2004) proposed a method for estimating the unknown location and scale parameters 


under the zero assumption (i.e., the non-null or the mixing distribution (signal) g(z) = 0 in a 
certain interval around the z = 0, which is known to be prone to bias) for F 0 = <h[(x—po)/co]- 


Muralidharan (2010) estimates by putting a Dirichlet(/3) prior penalty on the 7 r, where (3 


is an extra tuning parameter. Here we propose a simple and fast solution that works quite 
nicely. 


In the event all the observations Z 1; ..., Z^ are coming from F 0 , we have F(z) = F 0 [(z — 
ho)/ cr o] ! where F 0 is the theoretical location-scale null distribution. We could have easily 
estimated the unknown po and Co just by fitting a simple linear regression model in the 
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QQ plot. But instead, we have few Z t 's, which are coming from G (7r 0 -contamination 
neighborhood of Fq). If we knew which of these were unusual observations we could have 
removed them before fitting the linear regression. But as this is not the case, the natural 
step is to fit robust linear regression instead of linear least square, which is resistant to the 
data contamination. To get the estimates, solve the M-estimating equation: 


YTi =1 F 1 (u i ) ~Ho~ cr 0 F 0 1 (u i ) 


= 0, 


Efc!*? F '(“i) -Mo -Oofo Vi) F 0 '(«() = 0, 


(3.9) 


where we choose Ty to be the Tukey’s biweight influence function (Beaton and Tukev 1974) 
given by 

z[ 1 — ( z/k ) 2 ] 2 for \z\ < k , 

0 for \z\ > k. 


^t(z) — 


The value k = 4.685 is usually used as a default choice that provides 95% asymptotic 
efficiency under normality and still offers protection against outliers. A straightforward R- 
implementation is possible via MASS library function rim; see Web Appendix G for R-code. 
Table [l] reports the findings of our approach for five large-scale studies discussed in Chapter 


6 of Efron (2010). Finally, produce the null-adjusted uonparametric comparison density 


estimate using F 0 := F 0 - 


MO,0- o 


d(u;,F 0 ,F) = ./ R (/■;,: a,$) x ,7 (/'r !/'),; a,/?); /'r. /■') 


(3.10) 


[Table 1 about here.] 


4 Examples 


4.1 Real Data Application 


Prostate cancer data (Singh et ah, 2002) consists of 102 patient samples (50 labeled as 


normal and 52 as prostate tumor samples) and 6033 gene expression measurements. We 
aim to detect interesting genes that are differentially expressed in the two samples. For this 
purpose, we compute the two-sample t-test statistic t t for each gene and convert them into 
z-scale by z t < fA 1 (7ioo( i b))> where T denotes the t-distribution function, shown in Fig 
2(A) of Web Supplement. Fig 2(D) of Web Supplement shows the final beta-preflattened 
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smooth estimate of comparison density given by 


d(u; <L, F) = .68 [l + 0.057Leg 6 (F b (w; a = .81, (3 = -82))] u 19 (1 — u) 18 , 0 < u < 1, 

(4.1) 

which along with the Minimum Deviance Criteria gives 7To = -971. The representation result 
(2.5|) given in Theorem 3 immediately provides a CD-based estimate of local fdr (which we 


call CDfdr). We compare our result with Locfdr and Mixfdr (Muralidharan, 2010) that 
estimate the fdr by separately estimating the numerator /o and the denominator / (see Web 
Appendix D) . Naturally there are many variants available for these two methods depending 
on the way they estimate null and marginal densities. We have used the R package locfdr 
and mixfdr for implementation purpose. Locfdr estimates pool destiny / using splines. 
Methods for estimating null include the following: (a) theoretical (A/"(0,1)); (b) maximum 
likelihood (MLE); (c) central matching (CM); (d) split-normal (SN). Mixfdr implement J 
group normal mixture model for /. Estimation of empirical null involves putting Dirichlet 
prior on mixing proportion. We have used the default choice of J and Dirichlet parameter 
P throughout. All the three empirical null methods, including ours (reported in Table 1) 
shows that the theoretical null is not drastically different from 1). A very slight scale 
correction was required. 


[Figure 2 about here.] 

Each the empirical null based estimated fdr functions, including ours, shows strong similarity 
in the tail-regions. The behavior of the fdr function in the tail-region (compared to the 
central region) is most crucial, as this determines which p-values will be declared as false or 
non-null. So, it might be more appropriate to focus on the quality of “tail-modeling” instead 
of considering the entire curve. We carry out this in the next section where we carefully 
quantify the estimation accuracy, especially in the tails. 

The number of non-null genes identified by different methods, along with the estimates of 
the proportion of true null is provided in the Web Supplement Table 2. First note that the 
estimates of no from Locfdr-CM and Locfdr-SN are unrealistic as they exceed 1. Number of 
non-null genes identified using CDfdr matches with the Locfdr-SN and Mixfdr-Emp method. 
Furthermore, our proposed minimum deviance estimate 7To is very close to the Mixfdr-Th, 
Mixfdr-Emp and Locfdr-Th methods. 
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4.2 Simulation Study 


In order to further evaluate the accuracy of the CDf dr algorithm, we perform two simulated 
experiments. We are mainly interested in investigating how accurately different methods 
estimate fdr, especially in the tail-regions based on the mean integrated square error (MISE) 


criteria. Comparisons will be done with Locfdr, Mixfdr and Fdrtool (Strimmer, 2008). 


Grenander density estimation is used in Fdrtool for estimating the unconditional density / 
and is implemented in R package Fdrtool. 


4.2.1 Mixture Normal 


We simulate T) ~ 1), i — 1,..., N = 5000 out of which 4500 fi l is set to zero. The 

remaining 500 is drawn from (once and for all) W(/i, 1). We estimate the fdr for various 
methods and repeat the whole process 150 times for /i = 0.2, 0.5,1, 2. Our setup closely 


follows Storey (2002) and Muralidharan (2010). 


Local fdr estimation. The goal is to investigate the efficiency of different estimation methods 
when we fix the null density at J\f( 0,1). Web Supplement Figure 3 depicts the expectation 
and standard deviation of local fdr for various methods under four different choices of p. For 
H — 2, when the signal and noise are well-separated, all of the methods perform equally well 
apart from Fdrtool, which not only shows high bias but has large variability in the crucial 
tail region. Under the more difficult scenario of p = .2, clearly CDfdr is the only method 
that can claim to be unbiased. The variability of Mixfdr and CDfdr seems similar, though 
in the extreme (right) tail, CDfdr shows more stability. If we look at the Sd(fdr) curves 
for CDfdr and compare with other competing methods, it appears to be the least variable. 
Also, CDfdr achieves near unbiasedness irrespective of the underlying signal strength , which 
makes it a reliable tool for large-scale inference problems. 


Null proportion estimation. It is interesting to examine how all of these methods would 
perform in estimating the true null proportion under different degrees of signal strength. 
We implemented our Minimum Deviance Criteria (MDC) algorithm to estimate n 0 . The 
simulation was done under the same experimental setup. The boxplots of estimates of tiq 
under different non-null densities are shown in Web Supplement Figure 4. Locfdr shows the 
largest variability. Mixfdr certainly performs best among the competing methods. However, 
the method that was particularly successful for estimating the true null proportion i r 0 = .9 
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quickly and accurately is based on the our proposed MDC. This makes the CDfdr algorithm 
more powerful and efficient as a signal detection tool that protects against false discovery. 

4.2.2 Mixture Uniform 

We generate p-values from the following model with the parameter choices: tiq = {0.9, 0.95, 0.99} 
and a = {0.02,0.002}: 

7r 0 Uniform[0,1] + (1 — 7T 0 ) Uniform[0, a]. (4.2) 

Here we particularly pay attention to the tails and for that we consider the tail-specific 
MISE criteria E J 5 (fdr(w) — fdr(w)) 2 d u, where S denotes the collection of u coming from 
the alternative model U[0, a]. The goal is to quantify how precisely the fdr is estimated for 
the signals (in the tail). Here the parameter a controls the signal strengths and parameter 
7 Tq determines the underlying sparsity levels. 

[Figure 3 about here.] 

Our simulation design covers the complete spectrum from 

dense and weak —y rare and weak —y strong and dense —> strong and sparse signal. 

In the presence of weak signals (a = 0.02), the Locfdr and Mixfdr show a great deal of 
variability, as shown in Fig [3] (first row). CDfdr maintains the smallest tail-specific MISE 
among all the methods, which again ensures its utility. For strong signals (second row of Fig 
|3j) , most of the methods have reasonable performance, except perhaps, (a = 0.002,7r 0 = 0.9) 
case where the Locfdr poorly approximates the tail. The large number of outliers for Fdrtool 
are also not desirable. 

Overall, it is encouraging that CDfdr adapts to the underlying signal sparsity and strength in 
many cases , which makes it very attractive and reliable for large-scale studies. Undoubtedly 
for the examples we have discussed in this paper it appears that CDfdr shows the most 
consistent and robust performance. 

5 Conclusion 

We have shown how the concepts and notations of comparison density and distribution (1) 
provide unification (and an alternative rationale) of the two cultures of multiple testing: 
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Frequentist BH procedure and Efron’s empirical Bayes local fdr idea; and (2) lead to a 
new class of efficient nonparametric signal detection algorithms. Our approach converts the 
mixed sample signal detection problem into a nonparametric comparison density function 
estimation problem. The author expects proposed CD-based nonparametric functional sta¬ 
tistical approach will not only simplify the theory (which has an enormous literature) and 
practice but will also allow the possibility to include large-scale inference topics as part of a 
conventional statistical inference curriculum that can cover one-sample, mixed-sample and 
two-sample problems using single concept, notion, and algorithm. Our unified treatment by 
linking different large-scale multiple testing ideas (rather than presenting them as “unrelated 
tools”) conld accelerate the learning for students. 

We made a crucial observed that almost all large-scale inference problems produce compar¬ 
ison density with a typical stylized shape - “U” shaped density over the compact interval 
[0,1], which is not easily estimable using traditional nonparametric density estimation tech¬ 
niques. To address this, we introduced a new genre of density estimation methods based on 
the idea of Skew-G density decomposition. This allows for richer data-driven specification 
for tail-modeling via simple parametric models, which has an added advantage of being in¬ 
terpretable and easily implementable. We believe this technique can be used in many other 
modeling problems (outside multiple testing), such as heavy-tailed density estimation. 


In future work, we would like to extend this technique to discrete mixed sample problems. 


Although we believe the theory presented in Mukhopadhyay and Parzen (2014) will guide us 
in that direction, it will require more detailed studies. Systematic theoretical investigation of 
the asymptotic properties of the proposed nonparametric skew-beta density estimator is an 
unexplored and open problem. Additionally, we plan to understand the effect of dependence 
on our proposed density estimator and how it influences the false discovery thresholds. 


6 Supplementary Materials 

Web Appendices, Tables and Figures referenced in Sections 1.1, 2.2, 3.2, 3.4 and 4 are 
available with this paper at the Biometrics website on Wiley Online Library. Online link: 
http://bit.do/LSSD-BiometricsSupp, 
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Golub data: p—values 


Fitted Beta 





Figure 1: The concepts of ‘smooth’ p-value and the mechanism of skew-Beta density esti¬ 
mation for Golub gene expression data is illustrated; (A) [Top Left] Histogram of the 7129 
p-values of using two sample t-test; (B) [Top Right] Fitted beta distribution Beta(d = 
.32, = .75) to the p-values; (C) [Bottom Left] Distribution of the smooth p-values 

v = Fb(u ; a = .32,/3 = .75); (D) [Bottom Right] LP-adaptive orthogonal (Nonparametric) 
density estimator of the smooth Golub p-values: d(v; F B ,F) = 1 — .16 Leg 3 (u), 0 < v < 1. 
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Comparing estimated fdr for various methods 
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Figure 2: (color online) Estimated local fdr is shown for seven different methods for Prostate 
cancer data. 
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Figure 3: Compares the tail specific MISE for Uniform mixture model 7To Uniform[0,1] + 
(1 — 7To) Uniform[0, a]. The rows corresponds to a = 0.02 and 0.002. For each row the 
columns (from left to right) denotes 7T 0 = 0.9, 0.95, 0.99. 
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Table 1: Comparing three empirical null estimation methods on five large scale data sets. 
Readers are referred to Appendix B of Efron (2010) for more information on the data sets. 


Examples 

Methods 

ho 

So 


Robust biweight M-estimator 

.121 

.845 

HIV 

MLE (Locfdr) 

.115 

.753 

Penalized mixture model (Mixfdr) 

.131 

.838 


Robust biweight M-estimator 

.383 

1.25 

Gene-Tagging 

MLE (Locfdr) 

.310 

1.26 

Penalized mixture model (Mixfdr) 

.373 

1.23 


Robust biweight M-estimator 

.017 

1.88 

Leukemia 

MLE (Locfdr) 

.120 

1.59 

Penalized mixture model (Mixfdr) 

.017 

1.86 


Robust biweight M-estimator 

-0.001 

1.092 

Prostate 

MLE (Locfdr) 

-0.002 

1.087 

Penalized mixture model (Mixfdr) 

-0.002 

1.068 


Robust biweight M-estimator 

.087 

1.42 

NYC Police 

MLE (Locfdr) 

.121 

1.39 

Penalized mixture model (Mixfdr) 

.080 

1.42 
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